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Technological advances in genotyping have given rise to hypothesis- 
based association studies of increasing scope. As a result, the scientific 
hypotheses addressed by these studies have become more complex 
and more difficult to address using existing analytic methodologies. 
Obstacles to analysis include inference in the face of multiple compar- 
isons, complications arising from correlations among the SNPs (single 
nucleotide polymorphisms), choice of their genetic parametrization 
and missing data. In this paper we present an efficient Bayesian model 
search strategy that searches over the space of genetic markers and 
their genetic parametrization. The resulting method for Multilevel 
Inference of SNP Associations, MISA, allows computation of multi- 
level posterior probabilities and Bayes factors at the global, gene and 
SNP level, with the prior distribution on SNP inclusion in the model 
providing an intrinsic multiplicity correction. We use simulated data 
sets to characterize MISA's statistical power, and show that MISA 
has higher power to detect association than standard procedures. Us- 
ing data from the North Carolina Ovarian Cancer Study (NCOCS), 
MISA identifies variants that were not identified by standard methods 
and have been externally 'validated' in independent studies. We ex- 
amine sensitivity of the NCOCS results to prior choice and method 
for imputing missing data. MISA is available in an R package on 
CRAN. 
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1. Introduction. Recent advances in genotyping technology have re- 
sulted in a dramatic change in the way hypothesis-based genetic associa- 
tion studies are conducted. While previously investigators were limited by 
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costs to investigating only a handful of variants within the most interesting 
genes, researchers may now conduct candidate-gene and candidate-pathway 
studies that encompass many hundreds or thousands of genetic variants, of- 
ten single nucleotide polymorphisms (SNPs). For example, the North Car- 
olina Ovarian Cancer Study (NCOCS) (Schildkraut et al (2008)), an on- 
going population-based case—control study, genotyped 2129 women at 1536 
SNPS in 170 genes on 8 pathways, were 'pathway' is defined as a set of genes 
thought to be simultaneously active in certain circumstances. 

The analytic procedure most commonly applied to association studies of 
this scale is to fit a separate model of association for each SNP that adjusts 
for design and confounder variables. As false discoveries due to multiple 
testing are often a concern, the level of significance for each marginal test 
of association is adjusted using Bonferroni or other forms of false discov- 
ery correction (Storey, 2002; Wacholder, 2004; Balding, 2006). While these 
methods have been shown to be effective in controlling the number of false 
discoveries reported, correlations between the markers may limit the power 
to detect true associations (Efron, 2007). The NCOCS study provides a 
case-in-point. When simple marginal methods are applied to the NCOCS 
data, no SNPs are identified as notable. 

Marginal SNP-at-a-time methods do not address directly many of the sci- 
entific questions in candidate pathway studies, such as 'Is there an overall 
association between a pathway and the outcome of interest?' and 'Which 
genes are most likely to be driving this association?' The Multilevel Infer- 
ence for SNP Association (MISA) method we describe here is designed to 
simultaneously address these questions of association at the level of SNP, 
gene, and pathway. 

MISA, in contrast to the marginal methods, identifies ten SNPs of inter- 
est in the NCOCS study. To date, one of these (ranked tenth by MISA) 
has been validated in external data by a large multi-center consortium 
(Schildkraut et al, 2009); additional testing is underway for other top SNPs 
discovered by MISA. To buttress this empirical evidence, we demonstrate 
using simulation studies (Section 4) that MISA has higher power to detect 
associations than other simpler procedures, with a modest increase in the 
false discovery rate (Figure 1). 

In the next section, we describe the Bayesian hierarchical model behind 
MISA and highlight how it addresses many of the key issues in analy- 
sis of SNP association studies: identification of associated SNPs and ge- 
netic models, missing data, inference for multi-level hypotheses and con- 
trol of the false discovery rate. Like stepwise logistic regression (Balding, 
2006), lasso (Park and Hastie, 2008; Shi et al, 2007; Wu et al, 2009) and 
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logic regression (Ruczinski et al, 2003; Kooperberg and Ruczinski, 2004; 
Schwender and Ickstadt, 2007), MIS A improves upon marginal, SNP-at- 
a-time methods by modeling the outcome variable as a function of a mul- 
tivariate genetic profile, which provides measures of association that are 
adjusted for the remaining markers. MISA uses Bayesian Model Averaging 
(Hoeting et al, 1999) to combine information from multiple models of as- 
sociation to address the degree to which the data support an association at 
the level of individual SNPs, genes, and pathways, while taking into account 
uncertainty regarding the best genetic parametrization. By using model av- 
eraging, MISA improves upon methods that select a single model, which 
may miss important SNPs because of LD structure. We show how the prior 
distribution on SNP inclusion provides a built-in multiplicity correction. 
Because missing data are a common phenomenon in association studies, we 
discuss two options for handling this problem. 

In Section 3, we present an Evolutionary Monte Carlo algorithm to effi- 
ciently sample models of association according to their posterior probabili- 
ties. In Section 4 we apply our method to simulated data sets and demon- 
strate that MISA outperforms less complex and more commonly used al- 
ternatives for detecting associations in modestly powered candidate-gene 
case-control studies. The simulation approach may also be used to guide se- 
lection of the prior hyper-parameters given the study design. In Section 5 we 
return to the NCOCS study and present results from the analysis of a single 
pathway from that study. We examine the sensitivity of results to prior hy- 
perparameter choice and methods for imputing missing data. We conclude 
in Section 6 with recommendations and a discussion of future extensions. 

2. Models of Association. We consider SNP association models with 
a binary phenotype, such as presence or absence of a disease as in case- 
control designs. For i = 1,... ,n, let Di indicate the disease status of in- 
dividual i, where D{ = 1 represents a disease case and Di = represents 
a control. For each individual, we have S SNP measurements, where SNP 
s is either homozygous common (A S A S ), heterozygous (a s A s or A s a s ), ho- 
mozygous rare (a s a s ), or missing and is coded as 0, 1, 2, representing the 
number of rare alleles, or NA if the SNP is missing for that individual. We 
will discuss methods for imputing missing SNP data in Section 2.3. In ad- 
dition to the SNP data, for each individual we have a g-dimensional vector 
zf of design and potential confounding variables that will be included in all 
models, henceforth referred to as 'design' variables. 

We use logistic regression models to relate disease status to the design 
variables and subsets of SNPs. We denote the collection of all possible models 
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by 3Vt. An individual model, denoted by 3VC 7 , is specified by the S dimen- 
sional vector 7, where 7 S indicates the inclusion and SNP-specific genetic 
parametrization of SNP s in model UVC-y : 7 S = if SNP S ^ 3VC 7 , 7 S = 1 if 
SNP S G M 7 with a log-additive parametrization, 7 S = 2 if SNP S € M 7 
with a dominant parametrization, and 7 S = 3 if SNP S 6 3Vt 7 with a re- 
cessive parametrization. When no homozygous rare cases or controls are 
observed, we fix the genetic parametrization to be log-additive. Under each 
of these genetic parametrizations, SNP s may be encoded using one degree 
of freedom. In particular, for the log-additive model, the design variable 
representing SNP s is a numeric variable equal to the number of copies of 
the risk allele a s . For the dominant model, we use an indicator variable of 
whether allele a s is present (homozygous rare or heterozygous) and for the 
recessive model, an indicator variable of whether SNP s has the homozygous 
rare genotype. For each individual, the logistic regression under model 3VC 7 
assuming complete data is given by 

(2.1) logit(p(A = l|zj,x 7i ,0 7 ,M 7 )) = a + zf a + x 7 ^/3 7 

where x 7i represents the coding of SNPs in model M 7 and # 7 is the vec- 
tor of model specific parameters (ao, a T , /3 7 T ), with intercept ao, vector of 
design variable coefficients a, and log-odds ratios /3 7 . Prospective models 
for disease outcome given multivariate genetic marker data as in equation 
(2.1) provide measures of association that are adjusted for other markers 
which can increase the power to detect associations (Balding, 2006), how- 
ever, one is faced with an extremely large collection of possible models. 
While stepwise selection methods may be used to select a single model 
(Cordell and Clayton, 2002), this leads to difficulty in interpreting the sig- 
nificance of SNPs in the selected model. Bayesian model averaging is an 
alternative to stepwise selection methods and is an effective approach for 
identifying subsets of likely associated variables, for prioritizing them and 
for measuring overall association in the presence of model uncertainty (see 
the review articles by Hoeting et al. (1999) and Clyde and George (2004) 
and references therein). 

2.1. Posterior Inference. Posterior model probabilities measure the de- 
gree to which the data support each model in a set of competing models. 
The posterior model probability of any model JVt 7 in the space of models 
M is expressed as 



MULTILEVEL INFERENCE FOR SNP ASSOCIATION STUDIES 



5 



where p(D j 3VC~) is the (marginal) likelihood of model 3Vt 7 obtained after 
integrating out model-specific parameters # 7 with respect to their prior 
distribution, and p(3VC 7 ) is the prior probability of 3Vt 7 . 

While posterior probabilities provide a measure of evidence for hypothe- 
ses or models, it is often difficult to judge them in isolation as individ- 
ual model probabilities may be "diluted" as the space of models grows 
(Clyde, 1999; George, 1999; Clyde and George, 2004). Bayes factors (BF) 
(Kass and Raftery, 1995) compare the posterior odds of any two models (or 
hypotheses) to their prior odds 

BF( M 71 ; m 72 ) _ ^\ D \^l D) 

and measures the change in evidence (on the log scale) provided by data 
for one model, 3Vt 71 , to another, 3Vt 72 or for pairs of hypotheses. Goodman 
(1999) and Stephens and Balding (2009) provide a discussion on the useful- 
ness of Bayes factors in the medical context and Wakefield (2007) illustrates 
their use in controlling false discoveries in genetic epidemiology studies. Be- 
low we define Bayes factors for quantifying association at multiple levels 
(global, gene, and SNP) and assessing the most likely SNP-specific genetic 
parametrization. 

2.1.1. Global Bayes Factor. The Bayes factor in favor of Ha, the alter- 
native hypothesis that there is at least one SNP associated with disease, to 
Hq, the null hypothesis that there is no association between the SNPs under 
consideration and disease, measures the relative weight of evidence of Ha to 
Hq. The null model corresponding to Hq is the model which includes only 
design variables and no SNPs, and is denoted Mo- The alternative hypothe- 
sis is represented by all of the remaining models in 3VC. Because the space of 
models is large, the null model (or any single model in general) may receive 
small probability (both prior and posterior), even when it is the highest pos- 
terior probability model (this illustrates the dilution effect of large model 
spaces); Bayes factors allow one to judge how the posterior odds compare 
to one's prior odds. 

The Global Bayes factor for comparing Ha to Hq may be simplified to 

(2.2) BF(H A :H )= ]T BF(M 7 : M )p(M 7 | H A ) 

which is the weighted average of the individual Bayes factors BF(3VC-y : Mo) 
for comparing each model in Ha to the null model with weights given by the 
prior probability of M 7 conditional on being in Ha, p(M 7 j Ha)- Because 
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the alternative is a composite hypothesis, the resulting Global Bayes factor 
is not independent of the prior distribution on the models that comprise the 
alternative, thus the prior distribution on models will play an important role 
in controlling the (relative) weights that models of different sizes receive. 
For a large number of SNPs, it is impossible to enumerate the space of 
models and posterior summaries are often based on models sampled from 
the posterior distribution. In equation (2.2), if we replace the average over 
all models in Ha with the average over the models in S (the collection of 
unique models sampled from the posterior distribution), the result 



is a lower bound for the Bayes factor for testing global association. If the 
lower bound indicates evidence of an association, then we can be confident 
that this evidence will only increase as we include more models. 

2.1.2. SNP Bayes Factors. While it is of interest to quantify association 
at the global level, interest is primarily in identifying the gene(s) and vari- 
ants) within those genes that drive the association. We begin by defining 
SNP inclusion probabilities and associated Bayes factors. These marginal 
summaries are adjusted for the other potentially important SNPs and con- 
founding variables and provide a measure of the strength of association at 
the level of individual SNPs. Given each sampled model JVt-y € S and the 
model specification vectors 7 = (71,72, • • ■ , 75) previously defined in Section 
2, the inclusion probability for SNP s is estimated as: 



where p(3Vt-y | D,S) is the posterior probability of a model re-normalized 
over the sampled model space. The SNP Bayes factor is the ratio of the 
posterior odds of the SNP being associated to the prior odds of the same, 
and is defined as: 



BF(H A : H ) >BF S (H A :H )= £ BF(M 7 : M )p(M 7 | Ha) 



(2.3) 



p( 7 ,^0|Z>)= £ l( 7s ^o) PpVt 7 I D,§) 



BF( 7s + : 7s = 0) 



pin, ± 1 d) . P { ls + 0) 

p{ ls = 0\D) ' p{ ls = 0) ' 



where p{^ s 7^ 0) is the prior probability of SNP s being associated. Estimates 
of the SNP Bayes Factor may be obtained using the estimated SNP inclusion 
probabilities from (2.3). 
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2.1.3. Gene Bayes Factors. In cases where there are SNPs in Linkage 
Disequilibrium (LD), SNP inclusion probabilities may underestimate the 
significance of an association at a given locus. This occurs because SNPs in 
LD may provide competing explanations for the association, thereby diluting 
or distributing the probability over several markers. Since the amount of 
correlation between markers across different genes is typically negligible, 
calculating inclusion probabilities and Bayes factors at the gene level will 
not be as sensitive to this dilution. A gene is defined to be associated if one 
or more of the SNPs within the given gene are associated. Hence we define 
the gene inclusion probability as: 

p(r fl = l|£>)= Y, l(r 9 =i)P(M 7 |D,S); 

where T g = 1 if at least one SNP in gene g is in model 3VC-y and is zero 
otherwise. The gene Bayes factor is defined as: 

where p(T g = 1) is the prior probability of one or more SNPs in gene g being 
associated. 

2.1.4. Interpreting Evidence. Jeffreys (1961, page 432) presents a de- 
scriptive classification of Bayes factors into " grades of evidence" (reproduced 
in Table 1) to assist in their interpretation (see Kass and Raftery (1995)). In 
the context in which he presents the grades, he defined the Bayes factor as- 
suming equal prior odds, making it equivalent to posterior odds and enabling 
a meaningful interpretation in terms of probabilities. It is not clear whether 
he intended his descriptive grades to be used more broadly for interpreting 
Bayes factors or for interpreting posterior probabilities. 

Jeffreys was well aware of the issues that arise with testing several sim- 
ple alternative hypotheses against a null hypothesis (Jeffreys, 1961, Section 
5.04), noting that if one were to test several hypotheses separately that by 
chance one might find one of the Bayes factors to be less than one even if 
all null hypotheses were true. He suggested that, in this context, the Bayes 
factors needed to be "corrected for selection of hypotheses" by multiplying 
by the prior odds. 

Experience has shown that detectable SNP associations are relatively in- 
frequent, hence the prior odds of any given SNP being marginally associ- 
ated in the typical genetic association study should be small. For this reason, 
Stephens and Balding (2009) suggest that marginal Bayes factors calculated 
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assuming equal prior odds be interpreted in light of a prior odds more ap- 
propriate to the study at hand. Our approach to the problem of exploring 
multiple hypotheses is to embed each of the potential submodels (corre- 
sponding to a subset of SNPs) into a single hierarchical model. Unlike the 
marginal (one-at-a-time) Bayes factors in Stephens and Balding (2009) that 
are independent of the prior odds on the hypotheses, our SNP Bayes factors 
are based on comparing composite hypotheses and hence do depend on the 
prior distribution over models, which implicitly adjusts for the selection of 
hypotheses. 

While Bayes factors do not provide a measure of absolute support for or 
against a hypothesis (except with even prior odds) , the log Bayes factor does 
provide a coherent measure of how much the data change the support for 
the hypothesis (relative to the prior) (Lavine and Schervish, 1997). Applying 
Jeffreys grades to Bayes factors using priors distributions that account for 
competing hypotheses provides an idea of the impact of the data on chang- 
ing prior beliefs, but ultimately posterior odds provide a more informative 
measure of evidence and model uncertainty. 



Grade 


BF (H A ■ Ho) 


Evidence against Hq 


1 


1 to 3.2 


Indeterminate 


2 


3.2 to 10 


Positive 


3 


10 to 31.6 


Strong 


4 


31.6 to 100 


Very Strong 


5 


> 100 


Decisive 



Table 1 

Jeffrey's grades of evidence (Jeffreys, 1961, page 432). 



2.2. Prior Distributions, Laplace Approximations and Marginal Likeli- 
hoods. We assume normal prior distributions for the coefficients 6^ with a 
covariance matrix that is given by a constant 1/k times the inverse Fisher 
Information matrix. For logistic regression models, analytic expressions for 
p(D | DVC-y) are not available and Laplace approximations or the Bayes In- 
formation Criterion are commonly used to approximate the marginal likeli- 
hood (Raftery, 1986; Wakefield, 2007; Burton et al, 2007). Using a Laplace 
approximation with the normal prior distribution (see Appendix A), the 
posterior probability of model JVLy takes the form of a penalized likelihood 

(2.4) p(M-y | D) oc exp{-i[dev(M 7 ; D) + pen(M 7 )]} 

where dev(M n ,;D) = — 21og(p(D | 7 ,M 7 )) is the model deviance, and 
the penalty term pen(M-y) encompasses a penalty on model size induced 
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by the choice of k in the prior distribution on coefficients 0-y and the prior 
distribution over models. Because we expect that effect sizes will be small, we 
calibrate the choice of k based on the Akaike information criterion (Appendix 
A), leading to 

pen(M 7 ) = 2(1 + g + s 7 ) - 2 log(p(M 7 )). 

2.3. Missing Data. The expression in (2.4) assumes complete data on all 
SNPs. Missing SNP data, unfortunately, are the norm rather than the ex- 
ception in association studies. Removing all subjects with any missing SNP 
genotype data will typically result in an unnecessary loss of information 
and potential bias of estimated effects if the missing data are non-ignorable. 
It is possible, however, to exploit patterns in LD to efficiently impute the 
missing genotypes given observed data (Balding, 2006). We use fastPHASE 
(Stephens et al, 2001; Servin and Stephens, 2007) to sample haplotypes and 
missing genotypes (G m ) given the observed unphased genotypes (G°). This 
assumes that the pattern of missing data is independent of case-control sta- 
tus, which, if not true may lead to serious biases (Clayton et al, 2005). This 
assumption may be examined by using indicator variables of missingness as 
predictors in MISA. 

The posterior probabilities of models given the data are obtained by av- 
eraging the marginal likelihood of a model over imputed genotype data: 

p(M-, | D) oc J exp{-i[dev(M 7 ;D,G°,G m ) + pen(M 7 )]}p(G m | G°)dG m 
1 1 1 

( 2 - 5 ) ~ m H ex P{-^y0^y;G°,G^) +pen(M 7 )]} = $(^) 

i=l 

where / is the number of imputed data sets, dev(3VCy; D, G°, G m ) is the 
deviance based on the completed data, and ^(M-y) is an estimate of the 
un-normalized posterior model probability for model M-y. We have found 
that the number of imputed sets must be on the order of I = 100 to provide 
accurate estimates of posterior quantities. This has a significant computa- 
tional impact in the model search algorithm described in Section 3. As a 
simple alternative, we approximate (2.5) by a modal approximation, where 
the missing genotypes are imputed with the mode of the sampled genotypes 
using fastPHASE. While it is well known that plugging in a single estimate 
for the missing data under-estimates uncertainty, the modal approximation 
provides dramatic computational savings. In Section 5 we examine the sen- 
sitivity of results to the method of imputing missing data and find that the 
modal approximation gives comparable results for SNP BFs. 
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2.4. Choice of Prior Distribution on Models. The prior distribution on 
the space of models M, p(M-y), completes our model specification. The fre- 
quentist approach for SNP association studies usually involves some form of 
adjustment for multiple-testing, which can, in effect, penalize the researcher 
who looks beyond single-SNP models of association to multiple SNP models 
or models of interactions. Under the Bayesian approach, posterior evidence 
in the data is judged against the prior odds of an association using Bayes 
factors, which should not be affected by the number of tests that an inves- 
tigator chooses to carry out (Balding, 2006). 

While it has been common practice to adopt a "non-informative" uniform 
distribution over the space of models for association (this is after marginal- 
izing over the possible genetic models for each SNP), this choice has the 
potentially undesirable "informative" implication that ^ of the SNPs are 
expected to be associated a priori, and the prior odds of at least one SNP 
being included (which is used in the global Bayes factor) depends on the 
number of tests (2 s ) (Table 2). 

A recommended alternative is the Beta-Binomial distribution on the model 
size, which provides over-dispersion, added robustness to prior misspecifica- 
tion, and multiplicity corrections as a function of the number of variables 
(Ley and Steel, 2009; Scott and Berger, 2008; Cui and George, 2008). We 
construct a hierarchical prior distribution over the space of models defined 
by subsets of SNPs and their genetic parametrizations as follows. For any 
SNP included in the model, we assign a uniform distribution over the pos- 
sible genetic parametrizations. The prior distribution on the model size 
s-y is Bin(5, p) conditional on p, and for the last stage, p is assigned a 
Beta (a, b) distribution. Integrating over the distribution on p, leads to the 
BetaBinomial(a, b) distribution on model size, 

_ B(s^ + a,S-s^ + b) 

K ' ' PK 7; (5 + 1)B(* 7 + 1, S - s 7 + l)B(a, b) 

and the following distribution on models, 

'lyi B(s y + a,S-s^ + b) 



(2J > " (M - V3/ B(a, b ) 

where B(-,-) is the beta function and the factor of 1/3 accounts for the 
distribution over genetic parametrizations. 

2.4.1. Default Hyper- Parameter Choice. Following Ley and Steel (2009) 
and Scott and Berger (2008), we recommend a = 1 as a default, so that 
the prior distribution on model size is non-increasing in s~. The hyper- 
parameter b can then be chosen to reflect the expected model size, global 
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prior probability of at least one association, or the marginal prior odds that 
any SNP is associated (Table 2). A default choice is to set 6 = 1, leading to a 

Binomial Beta-Binomial Beta-Binomial 
(5,1/2) (1,1) (l.AS) 

Expected Model Size f (oo) § (oo) (i) 

Global Prior Odds (oo) S (oo) ± 

of an Association 

Marginal Prior Odds 1 1 Xs (°) 

of an Association 

Prior Odds of Adding a Variable 1 |*±i (0) (a+i^s-^-i (°) 

Table 2 

General prior characteristics and limiting behavior (in parentheses) of the Bin(5*, 1/2), 
BetaBinomial(l, 1) and BetaBinomial(l, AS) distribution on model size. 



uniform distribution on model size (Ley and Steel, 2009; Scott and Berger, 
2008). Like the binomial distribution, the BetaBinomial(l, 1) distribution re- 
sults in an expected model size of ^ (Table 2), although the BetaBinomial(l, 1) 
distribution has a larger variance than the Bin(S', 1/2). Alternatively, if b is 
proportional to S, b = XS the expected model size approaches a limit of y 
as S approaches infinity. 

The choices for hyperparameters have implications for the global Bayes 
factor. The BetaBinomial(l, 1) has a global prior odds of association equal 
to the number of SNPs, S, and would be appropriate for the case where 
increasing the number of SNPs under consideration reflects increased prior 
certainty that an overall (global) association can be detected. Under the 
BetaBinomial(l, XS), the global prior odds are constant, 1/A, reflecting a 
prior odds for overall association that is independent of the number of 
genes/SNPs tagged. Also, with both Beta-Binomial prior distributions, the 
prior odds of incorporating an additional SNP in any model decreases with 
model size and approaches in the limiting case as the number of SNPs, 
S, increases. This provides an implicit multiple testing correction in the num- 
ber of SNPs (rather than tests) that are included in the study of interest. 
The BetaBinomial(l, XS) achieves this by keeping the global (pathway) prior 
odds of an association constant while decreasing the marginal prior odds of 
any one of the SNPs being associated as the number of SNPs increases. As a 
skeptical "default" prior, we suggest the hyper-parameters a = 1 and b = S 
which leads to the global prior odds of there being at least one association 
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of 1 and the marginal prior odds of any single SNP being associated of 1/5. 

3. Stochastic Search for SNPs. Given the number of SNPs under 
consideration, enumeration of all models for S greater than 25-30 is in- 
tractable. While it is possible to enumerate all single variable SNP mod- 
els, the number of models with 2 or 3 SNPs allowing for multiple genetic 
parametrizations is in the millions or more for a typical modern hypothesis- 
oriented study. Stochastic variable selection algorithms (see Clyde and George, 
2004, for a review) provide a more robust search procedure than stepwise 
methods, but also permit calculation of posterior probabilities and Bayes 
factors based on a sample of the most likely candidate models from the 
posterior distribution. 

MISA makes use of a stochastic search algorithm based on the Evolu- 
tionary Monte Carlo (EMC) algorithm of Liang and Wong (2000). EMC is 
a combination of parallel tempering (Geyer, 1991) and a genetic algorithm 
(Holland, 1975) and samples models based on their "fitness". While origi- 
nally designed to find optimal models based on AIC, in our application the 
fitness of the models is given by ^>(3VC~) 

rJ>(My) = log(#(M 7 )) 

where ^(JVt-y) is defined in equation (2.5) and is equal to the log of the un- 
normalized posterior model probability. This results in models being gener- 
ated according to their posterior probability. 

The EMC algorithm requires that we specify the number of parallel chains 
that are run and the associated temperature for each chain that determines 
the degree of annealing. If the temperatures are too spread out for the num- 
ber of chains, then the algorithm may exhibit poor mixing and slow con- 
vergence. Liang and Wong (2000) show that even with all chains run at a 
temperature of 1 (no annealing), EMC outperforms alternative sampling 
methods such as Gibbs sampling and Reversible Jump MCMC in problems 
where strong correlations among the predictor variables lead to problems 
with exploring multiple modes in the posterior distribution. We have found 
that a constant temperature ladder with 5 parallel chains provides good mix- 
ing and finds more unique models than using a custom temperature ladder 
based on the prescription in Liang and Wong (2000), and recommend the 
constant temperature ladder as a default. To assess convergence, we take 
two independent EMC runs using randomly chosen starting points and ex- 
amine trace plots of the fitness function. We use the marginal likelihoods 
from the set of unique models in the sample for inference and compute esti- 
mates of marginal posterior inclusion probabilities for each run. We continue 
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running the two instances of the EMC algorithm until the posterior proba- 
bilities derived from each are sufficiently close. This leads to longer running 
times than those suggested by conventional convergence diagnostic such as 
Gelman-Rubin (Gelman and Rubin, 1992). 

Efficiency of stochastic algorithms often diminishes as the total number 
of models increases. For this reason, we have found it useful to reduce the 
number of SNPs included in the EMC search using a screen when S is large. 
Such a screen will typically be fairly permissive, leaving only the weak- 
est candidates out of the stochastic search. The screen should be quick to 
calculate, adjust for the same design variables and consider the same ge- 
netic parametrizations as in the full analysis. In our analyses, we calculated 
marginal (i.e. SNP-at-a-time) Bayes Factors for each of the log-additive, 
dominant and recessive models of association against the model of no asso- 
ciation. We ordered SNPs according to the maximum of the three marginal 
Bayes factors and retained those with a maximum marginal BF greater than 
or equal to one. More details are available in Appendix B. 

4. Simulation Comparison. We used the 124 simulated case - con- 
trol data sets, details of the simulation can be found in Appendix C, to 
estimate true and false positive rates for MISA and seven other alternative 
procedures: 

Bonferroni We fit a logistic regression model for each SNP under the log- 
additive parametrization and calculate the p-value for testing associ- 
ation using a Chi-Squared test. We use a Bonferroni corrected level 
a = 0.05 test to declare a SNP associated. 

Adjusted Bonferroni We fit a logistic regression model for each SNP un- 
der the log-additive parametrization and calculate the p-value for test- 
ing association using a Chi-Squared test. We use a Bonferroni corrected 
level a test to declare a SNP associated where a is chosen so that the 
proportion of false positives detected is the same as in MISA default 
(1) and custom (2). 

Benjamini-Hochberg We fit the same SNP-at a time logistic regression 
as above, but declare a SNP to be associated if it has a Benjamini- 
Hochberg false discovery rate of less than 0.05. 

Marginal BF This also utilizes the single SNP at a time logistic regression, 
but calculates a BF for association under each of the three genetic 
models. If the maximum BF over the three genetic models is greater 
than 3.2, we declare the SNP associated. See Appendix C for more 
detail. 

Stepwise LR (AIC) We use a stepwise multiple logistic regression proce- 
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dure to select SNPs based on AIC. Each SNP is coded using 2 degrees 
of freedom to select among the three genetic models. SNPs in the final 
model are called associated. 

Stepwise LR (BIC) Same as above but using BIC to select models. 

Lasso We use the Lasso2 package in R (Lokhorst et al, 2009) that is based 
on the algorithm developed by Osborne et al. (1999) to select SNPs 
based on the least absolute shrinkage and selection operator. Each 
SNP is coded using 2 degrees of freedom to represent the three genetic 
models and all SNPs in the final model with coefficients greater than 
zero are called associated. 

MISA We reduced the number of SNPs using the marginal Bayes factor 
method above to eliminate SNPs with a marginal BF > 1. We ran 
MISA using the default BetaBinomial(l, S) prior distribution on the 
models using two runs of 400,000 iterations based on convergence of 
the marginal inclusion probabilities. SNPs are called associated if their 
MISA SNP BF is greater than 3.2. All SNPs that did not pass the 
marginal screen step in MISA were declared not associated. 

The first four are single SNP methods, while the last three are multi-SNP 
methods that take into account the genetic parametrization for each SNP. 

Figure 1 shows the proportion of SNPs detected by each of the methods 
as a function of the assumed true odds ratio. Thus, at an odds ratio of 1.00 
we plot the proportion of SNPs that were falsely declared associated by 
each of the methods. While both Bonferroni and Benjamini-Hochberg have 
the smallest false positive rates, they have much lower power to detect true 
associations than any of the other methods; the marginal BF has the highest 
power out of the three marginal methods, and is comparable to lasso, a multi- 
SNP method. Stepwise model selection using BIC has the lowest power of 
the multiple SNP model selection procedures. Stepwise logistic regression 
using AIC to select a model, on the other hand, has high power to detect 
associations, but an unacceptably high false positive rate (44%). With the 
exception of stepwise/ AIC, the MISA methods have higher power than the 
alternatives at all odds ratios (ORs) in the simulation, with the gain in power 
most noticeable for the smaller ORs, those encompassing the range, 1.25 - 
1.75 typically seen in practice (Flint and Mackay, 2009). This increase in 
power comes at the cost of only a slight increase in the false positive rate. 
Overall, MISA using the default BetaBinomial(l, S) prior distribution is able 
to detect 9% as many associations at the SNP level and 13% as many at 
the gene level than the marginal BF method used alone. In addition, MISA 
is able to detect 19% as many true associations at the SNP level and 27% 
as many at the gene level as the calibrated Bonferroni method (the two 
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Fig 1. True and False positive rates of MISA versus alternative 
methods. 

methods have the same Type I error rate). 

4.1. Sensitivity to Hyperparameters. We examined a range of parameters 
(a and b) for the Beta-Binomial prior distribution on model size (Table 3) 
to assess sensitivity of true positive and false positive rates. In practice, this 
may be done by reweighting the MCMC output using the new prior distribu- 
tion, without resorting to additional MCMC runs, as long as high posterior 
probability models receive adequate support under both prior distributions. 

Over the range of values for (a, b), MISA has a higher gene and SNP true 
positive rate than any of the other simpler procedures, with the exception 
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of Stepwise AIC. In general, decreasing a leads to higher true positive rates, 
but at the expense of higher false positive rates. The SNP false positive rate 
is modest, ranging from 0.025 to 0.099, providing effective control of the 
experiment wide error rate. While these rates are higher than the false posi- 
tive rates under Bonferroni or Benjamini-Hochberg, eliminating a SNP from 
consideration that truly is associated has a higher scientific cost than con- 
tinuing to collect data to confirm that a SNP is really a null finding. Because 
the NCOCS will follow-up apparent associations, a higher true positive rate 
with a modest increase in false positives was preferable. 

The hyper-parameters a = 1/8 and b = S, highlighted in bold in Table 3 
were selected for comparison with the default choice (a = 1, b = S) in the 
analysis of the NCOCS data presented in the next section. MISA using the 
BetaBinomial(l/8, S) is able to detect 19% as many true associations at the 
SNP level and 26% as many at the gene level as the marginal BF method 
used alone. In addition, MISA with the BetaBinomial(l/8, S) prior is able 
to detect 14% as many true associations at the SNP level and 24% as many 
at the gene level as a calibrated Bonferroni method, (the two methods have 
the same Type I error rate). 

5. Ovarian Cancer Association Analysis. In this section, we de- 
scribe a MISA candidate pathway analysis of data from the ongoing NCOCS 
ovarian cancer case-control association study. The NCOCS is a population 
based study that covers a 48 county region of North Carolina (Schildkraut et al, 
2008). Cases are between 20 and 74 years of age and were diagnosed with pri- 
mary invasive or borderline epithelial ovarian cancer after January 1, 1999. 
Controls are frequency matched to the cases by age and race and have no 
previous diagnosis of ovarian cancer. In the analysis we present, we focus on 
self-reported Caucasians and a specific histological subtype of the cancer, 
leaving us a total of 397 cases and 787 controls. Because the ovarian cancer 
results have not yet been published, we have anonomyzed the pathway, the 
genes chosen to represent it and the IDs of the SNPs tagging variation in 
those genes. The pathway is comprised of 53 genes tagged by 508 tag SNPs. 

All models fit in the screen and by MISA included the patient's age as 
a design variable. We used the modal approximation to fill in missing SNP 
data. We screened 508 SNPs using marginal Bayes factors, retaining S = 70 
SNPs that exceeded the threshold of 1 in favor of an association. Using the 
default hyperparameters a = 1 and b = S, we ran two independent runs 
of the algorithm from independent starting points for a total of 1.2 million 
iterations — the point at which the SNP marginal inclusion probabilities 
from the two independent runs were determined to be in sufficiently close 
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Table 3 

Estimated overall false and true positive rates with standard errors and posterior odds 
(PO) of association at the gene and SNP levels. The values in bold characterize the 
method selected for use in the analysis of the NCOCS ovarian cancer example. 



agreement. 

On basis of this analysis, we estimate a lower bound on the pathway- 
wide Bayes factor for association to be V>¥(Ha '■ Hq) = 7.67 (which is 
also the posterior odds for this prior). This constitutes "positive" evidence 
in favor of an association between the pathway and ovarian cancer based 
on Jeffreys' grades of evidence and corresponds to a posterior probability 
that the pathway is associated of roughly 0.89. Figure 2 summarizes the 
associations of the ten SNPs that had a SNP BF greater than 3.2, while 
Figure 3 illustrates the seven genes that contained these SNPs and two 
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others that received comparable support. SNPs and genes in the pathway are 
denoted by a two level name (e.g. SI and Gl) where the number represents 
the rank of the SNP or gene by its respective Bayes factor. These plots 
provide a graphical illustration of the top 100 models TVLy € 3VC selected on 
basis of their posterior model probabilities. Models are ordered on the x— 
axis in descending probability and the width of the column associated with a 
model is proportional to that probability. SNPs (Figure 2) or genes (Figure 
3) are represented on the y-axis. The presence of a SNP or gene in a model 
is indicated by a colored block at the intersection of the model's column and 
the SNP's or gene's row. In Figure 2, the color of the block indicates the 
parametrization of the SNP: purple for log-additive, blue for recessive and 
red for dominant. The "checkerboard" pattern (as opposed to the presences 
of more vertical bars) suggests substantial model uncertainty. 

The top five models depicted in Figure 2 include only a single SNP in 
addition to age at diagnosis (the design variable is omitted in the figure 
as it is included in all models). The top model includes SNP SI in gene 
Gl under the log-additive genetic parametrization, which is estimated to 
have an odds ratio (OR) of approximately 1.42 (the posterior mode). The 
second ranked model includes only SNP S2 in gene Gl under the log-additive 
genetic parametrization with an estimated OR of 1.37. Note that the study 
has relatively low power to detect effects of this magnitude (Figure 1). 

Figure 2 also illustrates that many of the top models beyond the first five 
include multiple SNPs. This suggests that if we were to restrict our attention 
to single SNP models we would potentially lose substantial information re- 
garding their joint effects. For example, model six is comprised of both SNP 
S3 from gene G3 and SNP S2 from gene Gl, while model 12 is comprised of 
both SNP S3 from gene G3 and SNP SI from gene Gl. In both cases, SNP 
S3 is included in models with a SNP from gene Gl. This may indicate that 
not only are SNPs SI, S2, and S3 important as single effects in the top four 
models, but that their combined effects may be of interest. Note that, in 
cases where the disease variant is unmeasured but 'tagged,' several tagged 
SNPs may be required to explain variation at that locus. 

The SNP Bayes factors of SI (BF = 42.2) and S2 (BF = 17.8) provide 
"strong evidence" of changes in prior beliefs, however, the marginal poste- 
rior probabilities of association with ovarian cancer are 0.38 and 0.20, re- 
spectively. Figure 2 illustrates that when one of SNP SI or S2 is included in 
a model, the other is often not (at least in the top 50 models). This trade off 
often arises when SNPs are correlated (i.e. in high linkage disequilibrium). 
In this case, R 2 of 0.5 suggests fairly strong LD between SNPs SI and S2, 
in which case the joint inclusion probabilities are more meaningful than 
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Fig 2. Image plot of the SNP inclusion indicators for the SNPs with marginal Bayes 
factors greater than 3.2 and the top 100 Models. The color of the inclusion block corre- 
sponds to the genetic parametrization of the SNP in that model. Purple corresponds to 
a log-additive parametrization, red to a dominant parametrization and blue to a recessive 
parametrization. SNPs are ordered on basis of their marginal SNP Bayes Factors which are 
plotted on the right axis across from the SNP of interest. Width of the column associated 
with a model is proportional to its estimated model probability. 
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Fig 3. Image plot of the gene inclusion indicators for the top 100 Models. Genes are 
ordered based on their marginal gene Bayes Factors which are plotted on the right axis. 
Columns correspond to models and have width proportional to the estimated model proba- 
bility, models are plotted in descending order of posterior support. The color is chosen to 
be neutral since the genetic parametrizations are not defined at the gene level. 

marginal probabilities. Both SNP 1 and SNP 2 are in gene Gl which has a 
gene Bayes factor of 31.95 (Figure 3) and posterior probability of association 
of 0.58. These probabilities need to be interpreted in the context of model 
uncertainty; conditional on the pathway being associated with ovarian can- 
cer, the probability that gene Gl is driving the association is 0.58/0.89 = 
0.65. However, there remains substantial uncertainty regarding which genes 
and SNPs may explain it as the posterior mass is spread over competing 
models/hypotheses. The positive support for an association suggests the 
continuation of data accrual to refine these posterior probabilities. 

Gene Gl and other genes in Figure 3 highlight a caution regarding the 
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interpretation of Bayes factors as a measure of absolute support with com- 
posite hypotheses. The gene Bayes factor for Gl is 31.95, which is smaller 
than the SNP Bayes factors for SI (42.2). The posterior probability that 
gene Gl is associated is based on summing the probabilities of all models 
that include at least one SNP from that gene (SI, S2, and S51) hence the 
posterior probability for gene inclusion is always greater than or equal to 
the probability that any one SNP is included (i.e. posterior probabilities 
observe a monotonicity property with composite hypotheses). Bayes factors 
(and p-values) for composite hypotheses do not share this monotonicity 
property (Lavine and Schervish, 1997). Bayes factors for comparing com- 
posite hypotheses may be expressed as the ratio of the weighted average 
(with respect to the prior distribution) of marginal likelihoods conditional 
on the hypotheses, which may decrease the evidence in favor of a composite 
hypothesis when a subset of the individual hypotheses have low likelihood. 
As mentioned in Section 2.1.4, while Bayes factors do not provide a coher- 
ent measure of absolute support because of their non-monotonicity property, 
Lavine and Schervish (1997) show that the log Bayes factor does provide a 
coherent measure of how much the data change the support for the hy- 
pothesis (relative to the prior). Hence, they do provide useful summaries of 
changes in prior beliefs of association in large association studies with many 
competing models/hypotheses. 
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Table 4 

Analysis of variance for the ranked SNP Bayes factors contrasting the prior 
hyper-parameters (default a = 1 versus a = 1/8) and method of imputation (full 
imputation with 100 data sets versus a modal estimate of the missing genotypes) for the 
70 SNPs in the NCOCS pathway that passed the marginal screen. 



5.1. Sensitivity Analysis. In this section, we consider sensitivity of the 
results in the NCOCS study to the prior distribution on the models and to 
the method of imputation. The simulation study suggests that priors with 
smaller values of a may identify more associated SNPs. We estimated that 
the BetaBinomial(l/8, S) prior distribution on model size has a false posi- 
tive rate comparable to the marginal BF method, but a much higher true 
positive rate, in the scenarios we considered. Full data imputation, achieved 



22 



WILSON, IVERSEN, CLYDE, SCHMIDLER & SCHILDKRAUT 



by averaging over the distribution of missing SNPs, is probabilistically cor- 
rect, but computationally expensive. Thus, if the use of modal imputation 
provides an accurate approximation to BF calculated using full imputation, 
the computational efficiency of MIS A can be greatly improved at small cost. 

For purposes of this analysis, we used the set of unique models iden- 
tified by the EMC search with modal imputations and a = 1 and calcu- 
lated 3 additional sets of BFs. First, we obtained marginal likelihoods for 
each of these models using 100 imputed data sets with missing SNPs filled 
in based on their estimated distribution. Second, we calculated BFs using 
the BetaBinomial(l/8, S) and BetaBinomial(l, S) prior distributions using 
the marginal likelihoods under the full and modal imputations. We applied 
AN OVA to these four sets of BFs to compare the effects of prior hyperpa- 
rameters and imputation methods after adjusting for SNP using the ranked 
SNP BFs. 1 

Table 4 shows that the method of imputation has no significant effect on 
the ranking of SNP BFs. This suggests that, for purposes of model search 
and calculation of BFs, we may use the modal imputed genotypes in place 
of full imputation, with significant computational savings. For purposes of 
parameter estimation, we suggest the use of full imputation using a subset 
of the top models and top SNPs as using a plug-in approach for imputation 
is known to underestimate uncertainty. 

We anticipated that the prior distribution would have a significant effect 
based on the higher true positive and false positive rates estimated from 
the simulation study and by considering differences in the prior odds. While 
Table 4 suggests that overall the rankings are different between the two 
prior distributions, the top 20 SNPs have the same rank under each of the 
four methods, leading to no qualitative differences in our conclusions about 
the top SNPs. The prior odds for any given SNP's inclusion in a model 
are 8 times lower under the BetaBinomial(l/8, S) prior distribution than 
under to the BetaBinomial(l, S) prior distribution; the resulting SNP BFs 
are 2.8 times higher under the BetaBinomial(l/8, S) prior distribution than 
those under the BetaBinomial(l, S) prior distribution. As a result, eight more 
SNPs are above the 3.2 threshold used by the NCOCS to determine SNPs 
worthy of additional study. 

5.2. External Validation and Comparison. To provide a basis of com- 
parison, we applied the methods described in the simulation study (Section 
4) to the NCOCS data. We omitted stepwise logistic regression using AIC 



1 Ranks were used as residuals on the log scale still exhibited strong departures from 
normality. 
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because of its poor operating characteristics. The marginal FDR methods of 
Bonferroni and Benjamini-Hochberg failed to identify any significant SNPs. 
Lasso, which accounts for correlation among SNPS, also failed to identify 
any SNPS. Stepwise logistic regression using BIC selected a model with three 
of the top four SNPs identified by MISA — S1.G1, S3.G5 and S4.G4 — but 
failed to identify S2.G1, which has correlation 0.71 with SNP S1.G1. This 
highlights a problem with selection methods that ignore model uncertainty. 

The NCOCS proposed two SNPs — S10 and S14 in G9 — for external 
validation by the Ovarian Cancer Association Consortium (OCAC), a large 
international multi-center consortium of ovarian cancer case-control studies. 
The decision to focus on these variants was made on basis of results from 
an earlier version of the NCOCS data set and on basis of the strong prior 
interest NCOCS researchers had in the gene (and not on basis of the analysis 
described above). Under the default BetaBinomial(l, S) prior distribution, 
only SNP S10 in G9 exceeds the 3.2 threshold and the G9 BF is only 2.28. 
In contrast, under the BetaBinomial(l/8, S) prior distribution, both SNPs 
S10 and S14 (LD 0.62) in G9 have SNP BFs greater than 3.2 (8.70 and 5.99, 
respectively) and the gene BF is 6.18. An additional three SNPs in the same 
gene were proposed by another member of the consortium on the basis of 
uncorrected p-values. Of the five SNPs proposed for validation, only SNPs 
S10 and S14 were confirmed to be associated with serous invasive ovarian 
cancer by OCAC (Schildkraut et al, 2009). 

6. Discussion. In this paper, we describe MISA, a natural framework 
for multi-level inference with an implicit multiple comparisons correction for 
hypothesis based association studies. MISA allows one to quantify evidence 
of association at three levels: global (e.g. pathway-wide), gene, and SNP, 
while also allowing for uncertainty in the genetic parametrization of the 
markers. We have evaluated MISA against established, simple to implement 
and more commonly used methods and demonstrated that our methodology 
does have higher power than these methods in detecting associations in mod- 
estly powered candidate pathway case-control studies. The improvement in 
power is most noticeable for odds ratios of modest (real world) magnitude 
and comes at the cost of only a minimal increase in the false positive rate. 
Like stepwise logistic regression, lasso and logic regression, MISA improves 
upon marginal, SNP-at-a-time methods by considering multivariate ad- 
justed associations. By using model averaging, MISA improves upon these 
multivariate methods that select a single model, which may miss important 
SNPs because of LD structure. These improvements have concrete implica- 
tions for data analysis: MISA identified SNPs in the NCOCS data that were 
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subsequently externally validated; none of the less complex methods con- 
sidered here highlighted these SNPs to be of interest. Currently, other top 
ranked SNPs in genes identified by MISA are undergoing external valida- 
tion. Finally, we note that while MISA was developed for binary outcomes in 
case-control studies, MISA is readily adaptable to accommodate other forms 
of outcome variables (e.g. quantitative traits or survival) that are naturally 
modeled within a GLM framework. 

APPENDIX A: IMPLIED PRIOR DISTRIBUTION UNDER AIC 

Given that a closed-form expression for the marginal likelihood is not 
available for logistic regression, we have used the AIC to approximate the 
likelihood. In what follows, we determine a prior distribution on model co- 
efficients that is consistent with AIC. 

We assume a normal prior distributions on the d 7 -dimensional vector of 
regression coefficients (log odds ratios) of the form 



p(0 7 |M 7 ) ~ N fty, -D'^J 



where 3 7 is the observed Fisher information under model 3Vt 7 evaluated 
at the maximum likelihood estimates (MLEs) 6-y. Setting the covariance 
matrix to be proportional to the inverse Fisher information ensures that the 
correlation structure in the prior distribution matches that of the likelihood. 

In order to approximate the marginal likelihood we used a Laplace approx- 
imation based on expanding the log-likelihood in a second-order Taylor's 
series expansion about # 7 : 



£j(Q-y I 3VC-y) ~ £j(Q-y | JVt-y ) ~ (0 -y 0-y)^ 3-y{0-y ^-y) 



leading to the approximate marginal likelihood 
p{D | M 7 ) rj exp{£(0 7 | M 7 )}x 

f 1 1 

J (2vr)— K 

^) 2 exp^^lM^jA^,^^ 1 )}; 

where A^ (# 7 , 3" 1 ) = exp{— \(Q-y — # 7 ) r 3 7 (# 7 — 0-y)}. Setting this approx- 
imate \og(p(D | 3VC-y) equal to — 0.5AIC we have equality when the prior 
mean i 7 is set to # 7 where the right-most term vanishes, and k = cxp (2)_i • 



MULTILEVEL INFERENCE FOR SNP ASSOCIATION STUDIES 



25 



Roughly speaking, this implies that the prior standard deviation of any stan- 
dardized log odds ratio is about 2.5. This suggests that the approximation 
of the marginal likelihood under AIC is reasonable for prior distributions 
with mean zero, as this provides enough dispersion to cover the range of log 
odds ratios anticipated. 

APPENDIX B: MARGINAL BAYES FACTOR SCREEN 

We used Laplace approximations to estimate the marginal Bayes Factors 
(BFs) used to screen the SNPs (Kass and Raftery, 1995). In particular, we 
estimated the marginal likelihood of each of the three genetic models of 
association (log-additive, dominant and recessive) and under the null model 
(model of no genetic association). The BF for a model of association is 
defined as the ratio of the marginal likelihood of that model of association 
to the marginal likelihood of the null model. 

We accounted for missing genetic data by averaging marginal likelihoods 
over the M = 100 imputed genetic data sets. This affected only the cal- 
culations under the three genetic models of association, but not the null 
model. Hence the BF for an association was computed as the average of 
imputation-specific BFs. 

In the ovarian cancer analysis, the model for each SNP was a logistic 
regression for disease status given the variable age and the model-specific 
genotype variable. Age was included in all models, including the 'null' model 
of no association. The simulation models were unadjusted as no design or 
confounder variables were simulated. We placed a normal, mean zero, stan- 
dard deviation two prior on the parameter of the genetic effect variable and 
flat, improper priors on the remaining log odds ratio parameters. We ordered 
SNPs according to the maximum of the three Bayes factors and considered 
those with a maximum greater than or equal to one in the MISA model 
search. Our software for calculating marginal Bayes factors is included in 
the MISA R package. 

APPENDIX C: GENETIC SIMULATIONS 

We used simulated case-control data to compare MISA and other com- 
monly used procedures for genetic association studies. The simulated data 
sets were structured so as to reflect the details — genes, tag SNPs, LD 
structure, and sample size — of a NCOCS candidate pathway study com- 
prised of 53 genes tagged by 508 tag SNPs. Genotypes were simulated in 
two stages. First, for each of the 53 genes represented in the data set, we 
phased the NCOCS control SNP genotype data and estimated recombina- 
tion rates using PHASE (Stephens et al, 2001), which provides estimates 
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of the population haplotype distribution. Phase is a Bayesian method that 
obtains approximate samples from the posterior distribution of all possible 
haplotype pairs (H) given the observed genotypes (G) using Gibbs sampling 
and estimates recombination rates empirically from this sample. Second, 
given a model of association and the PHASE output, we generated case- 
control data at the selected tags using HAPGEN (Marchini and Su, 2006). 
Hapgen is a program that simulates haplotypes for a case-control sample of 
individuals given a set of population haplotypes and recombination rates for 
the regions of interest and choice of the hypothetical associated SNP and its 
allele-specific odds ratios. 

We generated 124 simulated data sets as follows. Ten of the simulations 
are null; there are no associations in the genes of interest. The remaining 114 
simulations assume that a randomly chosen subset of 9 genes are associated 
and that within the associated genes, a single, randomly chosen SNP is 
the source of the association. Within the 114 associated simulations, the 
associated tag SNPs were accorded an odds ratio (OR) of 1.25,1.5, 1.75, 2.0, 
or 2.25 and assumed to have either a dominant genetic parametrization, log- 
additive genetic parametrization or a recessive genetic parametrization. The 
marginal distribution over odds ratios is given in Figure 1. The marginal 
distribution over genetic models was uniform. The simulations used for the 
power analysis can be found at the URL for the software. 

We have also developed a software package, SimGbyE, that creates sim- 
ulated case/control or survival data sets with one or more of the follow- 
ing assumed effects: genetic main effects (G), environmental main effects 
(E), Gene by Gene interactions (GbyG), Gene by environment interactions 
(GbyE). The assumed genetic one and two locus models of epistasis are 
chosen randomly from a set of models chosen from Li and Reich (2000). 
Then given a set of assumed coefficients on the effects mentioned above, 
an outcome variable is simulated (case/control or survival) based on a set 
user specified distribution parameters. This package differs slightly from the 
method used to develop the simulations within this article by estimating the 
population haplotype distribution from HapMap instead of using PHASE 
to estimate the distribution from the set of control SNP genotypes in the 
NCOCS data. 

The main function calls Hapgen to simulate one replicate from a specified 
chromosomal region given data from one of the HapMap II populations. 
The code generates samples of genotypes in a contiguous range of DNA using 
Hapmap release 21 (NCBI build 35) data. The position range may encompass 
an entire chromosome or simply bracket a gene or locus of interest. That 
function can also be used to simulate data from multiple independent regions 
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to generate a candidate gene/pathway sample or a genome-wide sample. 
The default is to generate population-based genetic samples. However, to 
build genetic simulations with main effects only, parameters can be set so 
that Hapgen will randomly choose a variant in the specified region as the 
disease allele and generate a case-control sample. To build more complex 
associations we have written a wrapper function to take the genetic samples 
produced by Hapgen and simulate an outcome variable based on genetic 
main effects with multiple genetic parametrizations, environmental main 
effects, Gene by Gene interactions, and Gene by environment interactions. 
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WEB RESOURCES 

The URL for the software for the methodology and simulations presented 
in this paper is: 

http : //www. isds . duke . edu/gbye/packages .html . 
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